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Abstract 

A popular approach to significance testing proposes to decide whether the given 
hypothesized statistical model is likely to be true (or false). Statistical decision theory 
provides a basis for this approach by requiring every significance test to make a decision 
about the truth of the hypothesis/model under consideration. Unfortunately, many in- 
teresting and useful models are obviously false (that is, not exactly true) even before 
considering any data. Fortunately, in practice a significance test need only gauge the 
consistency (or inconsistency) of the observed data with the assumed hypothesis/model 
- without enquiring as to whether the assumption is likely to be true (or false), or 
whether some alternative is likely to be true (or false). In this practical formulation, a 
significance test rejects a hypothesis/model only if the observed data is highly improb- 
able when calculating the probability while assuming the hypothesis being tested; the 
significance test only gauges whether the observed data likely invalidates the assumed 
hypothesis, and cannot decide that the assumption — however unmistakably false - 
is likely to be false a priori, without any data. 

Essentially, all models are wrong, but some are useful. — G. E. P. Box 
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1 Introduction 



As pointed out in the above quotation of G. E. P. Box, many interesting models are false 
(that is, not exactly true), yet are useful nonetheless. Significance testing helps measure 
the usefulness of a model. Testing the validity of using a model for virtually any purpose 
requires knowing whether observed discrepancies are due to inaccuracies or inadequacies in 
the model or (on the contrary) could be due to chance arising from necessarily finite sample 
sizes. Significance tests gauge whether the discrepancy between the model and the observed 
data is larger than expected random fluctuations; significance tests gauge the size of the 
unavoidable random fluctuations. 

A traditional approach, along with its modern formulation in statistical decision theory, 
tries to decide whether a hypothesized model is likely to be true (or false). However, in 
many practical circumstances, a significance test need only gauge the consistency (or incon- 
sistency) of the observed data with the assumed hypothesis/model — without ever enquiring 
as to whether the assumption is likely to be true (or false), or whether some alternative is 
likely to be true (or false). In this practical formulation, a significance test rejects a hypoth- 
esis/model only if the observed data is highly improbable when calculating the probability 
while assuming the hypothesis being tested. Whether or not the assumption could be exactly 
true in reality is irrelevant. 

An illustrative example may help clarify. When testing the goodness of fit for the Poisson 
regression where the distribution of Y given x is the Poisson distribution of mean exp(#(°) + 
9^x + O^x 2 + 9^x 3 ), the conventional Neyman- Pearson null hypothesis is 

Hq P : there exist real numbers 9^°\ 9^\ 9^, 9^ such that y±, 1/2, ■ ■ ■ ,y n are independent 
draws from the Poisson distributions with means /i 2 , • • • , fi n , respectively, (1) 

where 

ln(/i fc ) = + 9^x k + 9^ 2 \x k ) 2 + 9^\x k f (2) 

for k — 1, 2, . . . , n, and the observations (xi,yi), (2:2,2/2), . . . , (x n ,y n ) are ordered pairs of 
scalars (real numbers paired with nonnegative integers). A related but perhaps simpler null 
hypothesis is 

Ho '■ yi, 2/2, • • • , y n are independent draws from the Poisson distributions 

with means /ti, fi 2 , • • • , An, respectively, (3) 

where 

\n(fi k ) = 0<°> + §V>x k + ^\x k ) 2 + 9^\x k f (4) 

for k — 1, 2, . . . , n, with 9 being a maximum-likelihood estimate. Needless to say, even if 
the observed data really does arise from Poisson distributions whose means are exponentials 
of a cubic polynomial, the particular values §(°\ 9^ 2 \ 9^ of the parameters of the fitted 
polynomial will almost surely not be exactly equal to the true values. Even though the 
estimated values of the parameters may not be exactly correct, it still makes good sense to 
enquire as to whether the fitted cubic polynomial is consistent with the data up to random 
fluctuations inherent in using a finite amount of observed data. 
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In fact, since subsequent use of the model usually involves the particular fitted polynomial 
- whose specification includes the observed parameter estimates — analyzing the model 
including the estimated values of the parameters makes more sense than trying to decide 
whether the data really did come from Poisson distributions whose means are exponentials 
of some unspecified cubic polynomial. For instance, any plot of the fit (such as a plot of the 
means of the Poisson distributions) must use the estimated values of the parameters, and 
any statistical interpretation of the plot should also depend explicitly on the estimates; 
a significance test can gauge the consistency of the plotted fit with the observed data, 
without ever asking whether the plotted fit is the truth (it is almost surely not identical 
to the underlying reality) and without making some decision about an abstract family of 
polynomials which may or may not include both the plotted fit and the underlying reality. 

A popular measure of divergence from the null hypothesis is the log-likelihood-ratio 

n 

9 2 = 2^2y k ln(y k /p, k ). (5) 

k=l 

A P-value (see, for example, Section [3] below) quantifies whether this divergence is larger than 
expected from random fluctuations inherent in using only n data points. It is not obvious how 
to calculate an exact P-value for Hq F from ([1]) and (j2J), which refers to cubic polynomials 
with undetermined coefficients. In contrast, Hq from ([3]) and (j3J) refers explicitly to the 
particular fitted value 9; H concerns the particular fit displayed in a plot, and is natural for 
the statistical interpretation of such a plot. 

Thus, when calculating significance, the assumed model should include the particular 
values of any parameters estimated from the observed data. Such parameters are known as 
"nuisance" parameters. As illustrated with Hq from d3J) and (jl]), the assumed hypothesis 
will be "simple" in the Neyman-Pearson sense, but will depend on the observed values of 
the parameters — that is, the hypothesis will be "data-dependent"; the hypothesis will be 
"random." Including the particular values of the parameters estimated from the observed 
data replaces the "composite" hypothesis of the conventional Neyman-Pearson formulation 
with a "simple" data-dependent hypothesis. As discussed in SectionH]below, fully conditional 
tests also incorporate the observed values of the parameters, but make the extra assumption 
that all possible realizations of the experiment — observed or hypothetical — generate the 
same observed values of the parameters. The device of a "simple data-dependent hypothesis" 
such as Hq incorporates the observed values explicitly without the extra assumption. 

For most purposes, a parameterized model is not really operational — that is, suitable 
for making precise predictions — until its specification is completed via the inclusion of 
estimates for any nuisance parameters. The results of the significance tests considered below 
depend on the quality of both the models and the parameter estimators. However, the results 
are relatively insensitive to the particular observed realizations of the parameter estimators 
(that is, to the parameter estimates) unless specifically designed to quantify the quality of 
the parameter estimates. To quantify the quality of the parameter estimates, we recommend 
testing separately the goodness of fit of the parameter estimates, using confidence intervals, 
confidence distributions, parametric bootstrapping, or significance tests within parametric 
models, whose statistical power is focused against alternatives within the parametric family 
constituting the model (for further discussion of the latter, see Section [5] below). 
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The remainder of the present article has the following structure: Section [2] ver y briefly 
discus ses Bayesian-frequentist hybrids, referring for details to the definitive work of IGelman 
( 120031 ) . Section [3] defines P- values — also known as "attained significance levels" — which 
quantify the consistency of the observed data with the assumed models. Section H] details 
several approaches to testing the goodness of fit for distributional profile. Section [5] discusses 
tes ting t he go odness of fit for various properties beyond just distributional profile. 



Coxl ( 120061 ) details many advantages of interpreting significance as gauging the consistency 



of an assumption/hypothesis with observed data, rather than as making decisions about the 
actual truth of the assumption. However, significance testing is meaningless without any 
observations, unlike purely Bayesian methods, which can produce results without any data, 
courtesy of the prior (the prior is the statistician's system of a priori beliefs, accumulated 
from prior experience, law, morality, religion, etc., without reference to the observed data). 
Significance tests are deficient in this respect. Those interested in what is to be considered 
true in reality and in making decisions more generally should use Bayesian and sequential (in- 
cluding multilevel) procedures. Significance testing simply gauges the consistency of models 
with observed data; generally significance testing alone cannot handle the truth. 



2 Bayesian versus frequentist 

Traditionally, significance testing is fre quentist. Howev er, there exist Bayesian-frequentist 



hybrids known as "Bayesian P- values"; iGefmanl (120031 ) sets forth a particularly appealing 



formulation. Bayesian P-values test the consistency of the observed data with the model 
used together with a prior for nuisance parameters. In contrast, the P-values discussed in 
the present paper test the consistency of the observed data with the model used together 
with a parameter estimator. In the Bayesian formulation, a P-value depends explicitly on 
the choice of prior; in the formulation of the present paper, a P-value depends explicitly on 
the choice of parameter estimator. Thus, when there are nuisance parameters, the two types 
of P-values test slightly different hypotheses and provide slightly different information; each 
type is ideal for its own set-up. Of course, if there are no nuisance parameters, then Bayesian 
P-values and the P-values discussed below are the same. 



3 P-values 

A P-value for a hypothesis Hq is a statistic such that, if the P-value is very small, then 
we can be confident that the observed data is inconsistent with assuming Hq. The P-value 
associated with a measure of divergence from a hypothesis Hq is the probability that D > d, 
where d is the divergence between the observed and the expected (with the expectation 
following Hq for the observations), and D is the divergence between the simulated and the 
expected (with the expectation following Hq for the simulations, and with the simulations 
performed assuming Hq). When taking the probability that D > d, we view D as a random 
variable, while viewing d as fixed, not random. For example, when testing the goodness of 
fit for the model of i.i.d. draws from a probability distribution po(9), where 9 is a nuisance 
parameter that must be estimated from the data, that is, from observations Xi, x^, ■ ■ ■ , x n , 
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we use the null hypothesis 



H Q : xi, x 2 , ■ ■ ■ , x n are i.i.d. draws from po(9), where 9 = 9(xi,x 2 , ■ ■ ■ , x n ). (6) 
The P-value for H associated with a divergence S is the probability that D > d, where 

• d = 6(p,po(§)), 

• p is the empirical distribution of x%, x 2 , ■ ■ ■ , x n , 

• 9 is the parameter estimate obtained from the observed draws x\, x 2 , ■ ■ ■ , x n , 

• D = 5(P,p (Q)), 

• P is the empirical distribution of i.i.d. draws X 1; X 2 , . . . , X n from Po(9), and 

• is the parameter estimate obtained from the simulated draws X±, X 2 , . . . , X n . 

If the P-value is very small, then we can be confident that the observed data is inconsistent 
with assuming H . Examples of divergences include x 2 (for categorical data) and the maxi- 
mum absolute difference between cumulative distribution functions (for real- valued data). 

Remark 3.1. To compute the P-value assessing the consistency of the expe rimental data 



with a ssuming Hq, we can use Monte-Carlo simulations (very similar to those of lClauset et al. 



( 20091 )). First, we estimate the parameter 9 from the n given experimental draws, obtaining 
9, and calculate the divergence between the empirical distribution and Po(9). We then run 
many simulations. To conduct a single simulation, we perform the following three-step 
procedure: 

1. we generate n i.i.d. draws according to the model distribution Po(9), where 9 is the 
estimate calculated from the experimental data, 

2. we estimate the parameter 9 from the data generated in Step 1, obtaining a new 
estimate 9, and 

3. we calculate the divergence between the empirical distribution of the data generated in 
Step 1 and Po{9), where 9 is the estimate calculated in Step 2 from the data generated 
in Step 1. 

After conducting many such simulations, we may estimate the P-value for assuming H 
as the fraction of the divergences calculated in Step 3 that are greater than or equal to 
the divergence calculated from the empirical data. The accuracy of the estimated P-value is 
inversely proportional to the square root of the number of simulations conducted; for details, 
see Remark 13.21 below. 

Remark 3.2. The standard error of the estimate from Remark 13.11 for an exact P-value P 
is i/P(l — P)/£, where £ is the number of Monte-Carlo simulations conducted to produce 
the estimate. Indeed, each simulation has probability P of producing a divergence that is 
greater than or equal to the divergence corresponding to an exact P-value of P. Since the 
simulations are all independent, the number of the £ simulations that produce divergences 
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greater than or equal to that corresponding to P-value P follows the binomial distribution 
with £ trials and probability P of success in each trial. The standard deviation of the number 
of simulations whose divergences are greater than or equal to that corresponding to P-value 
P is therefore y/lP(\ — P), and so the standard deviation of the fraction of the simulations 
producing such divergences is a/P(1 — P)/t Of course, the fraction itself is the Monte- 
Carlo estimate of the exact P-value (we use this estimate in place of the unknown P when 
calculating the standard error a/P(1 — P)/£)- 

4 Goodness of fit for distributional profile 

Given observations xi, X2, ■ ■ ■ , x n , we can test the goodness of fit for the model of i.i.d. 
draws from a probability distribution po(9), where 9 is a nuisance parameter, via the null 
hypothesis 

H : x±, X2, ■ ■ ■ , x n are i.i.d. draws from po(9) 

for the particular observed value of 9 = 9(x 1 ,x 2 , ■ ■ ■ , x n ). (7) 

The Neyman-Pearson formulation considers instead the null hypothesis 

Hq P : there exists a value of 9 such that xi, x%, ■ ■ ■ , x n are i.i.d. draws from po(9). (8) 

The fully conditional null hypothesis is 

Hq C : xi, X2, ■ ■ ■ , x n are i.i.d. draws from po(9) 

and 9 = 9(xi,X2, ■ ■ ■ , x n ) takes the same value in all possible realizations. (9) 

That is, whereas H supposes that the particular observed realization of the experiment 
happened to produce a parameter estimate 9 that is consistent with having drawn the data 
from po(9), H^ c assumes that every possible realization of the experiment — observed or 
hypothetical — produces exactly the same parameter estimate. Few experimental apparatus 
constrain the parameter estimate to always take the same (a priori unknown) value during 
repetitions of the experiment, as Hq c assumes. Assuming H$ c amounts to conditioning on a 
statistic that is minimally sufficient for estimating 9; computing the associated P-values is not 
always trivial. Furthermore, the assumption that Hq c is true seems to be more extreme, a 
more substantial departure from Hq P , than H . Finally, testing the significance of assuming 
Hq would seem to be more apropos in practice for applications in which the experimental 
design does not enforce that repeated experiments always yield the same value forp (#)- We 
cannot recommend the use of H FC in general. Unfortunately, Hq P also presents problems. . . . 

If the probability distributions are discrete, there is no obvious means for defining an 
exact P-value for H^ p when -f^^ p is false; moreover, any P-value for -£^ p when -f^^ p is 
true would depend on the correct value of the parameter 9, and the observed data does 
not determine this value exactly. The situation may be more favorable when measuring 
discrepancies with divergences that are "approximately ancillary" with respect to 9, but 
quantifying "approximately" seems to be problematic except in the limit of large numbers of 
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draws. (Some divergences are asymptotically ancillary in the limit of large numbers of draws, 
but this is not especially helpful, as any asymptotically consistent estimator 9 converges to 
the correct value in the limit of large numbers of draws; 9 is almost surely known exactly in 
the limit of large n umbers of draws, so ther e is no benefit to being independent of 9 in that 
limit.) Section 3 of Robins and WassermanI (l200Qh reviews these and related issues. 

Remark 4.1. Romano ( 1988 ). Henze ( 19961 ). Bickel et al. ( 2006 ). and others have shown 
that the P-values for H converge in distribution to the uniform d istribution over [0, 1] in the 
l imit o f large numbers of draws, when Hq F is true. In particular, Romano ( 1988h and Henze 
(119961 ) prove this convergence for a wide class of divergence measures. 



Remark 4.2. The surveys of lAgrestil (Il992l ) and lAgrestil (120011 ) discuss exact P-values 
for cont i ngency -tables/ cross-tabulations, including criticism of fully con ditiona l P-va lues. 



Gelman ( 20031 ) provides further criticism of fully conditional P-values. Ward ( 2012 ) nu- 



merically ev aluates the different t ypes of P-values for an application in population genetics. 
Section 4 of iBayarri and Bergerl (120041 ) and the references it cites discuss the menagerie of 
alternative P-values proposed recently. 



5 Goodness of fit for various properties 

For comparative purposes, we first review the null hypothesis of the previous section for 
testing the goodness of fit for distributional profile, namely 

H : xi, x 2 , ■ ■ ■ , x n are i.i.d. draws from po(9), where 9 = 9(xi,x 2 , ■ ■ ■ , x n ), (10) 

with 9 being the nuisance parameter. The measure of discrepancy for Ho is usually taken to 
be a divergence between the empirical distribution p and the model po(9) (in the continuous 
case in one dimension, a common characterization of the empirical distribution is the empir- 
ical cumulative distribution function; in the discrete case, a common characterization of the 
empirical distribution is the empirical probability mass function, that is, the set of empirical 
proportions). One example for p is the Zipf distribution over m bins with parameter 9, a 
discrete distribution with the probability mass function 

rfV) = % (11) 
3 

for j = 1, 2, 3, . . . , m, where the normalization constant is 

Ce = — — (12) 
2^=i 3 

and 9 is a nonnegative real number. 

When testing the goodness of fit for parameter estimates, we use the null hypothesis 

H' : xi, x 2 ,...,x n are i.i.d. draws from p {4> , 0), where 9 = 9(xi , X 2) • • • , %n ), (13) 

with 9 being the nuisance parameter and <fi being the parameter of interest (and with </>o 
being the value of <p assumed under the model). Please note that H Q and H' Q are actually 
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equivalent, via the identification po(9) = p ((p ,9). However, the measure of discrepancy for 
H is usually taken to be a divergence between and 0o rather than the divergence between 
p and po(6) that is more natural for H . Also, if is scalar- valued, then confidence intervals, 
confidence distributions, and parametric bootstrap distributions are more informative than 
a significance test. A significance test is appropriate if <ft is vector- valued. One example 
for pq is the sorted Zipf distribution over m bins with 9 being the power in the power law 
and with the maximum-likelihood estimate being a permutation that sorts the bins into 
rank-order, that is, po is the discrete distribution with the probability mass function 

p ° ,(4, ' 6)= mk (14) 

for j = 1, 2, 3, . . . , m, where the normalization constant Cg is defined in (j!2p with 9 being 
a nonnegative real number, and is a permutation of the numbers 1, 2, . . . , m. The choice 
for O that is of widest interest in applications is the identity permutation (that is, the 
"rearrangement" of the bins that does not permute any bins: 4>o(j) = j for j = 1, 2, . . . , m). 

When testing the goodness of fit for the standard Poisson regression with the distribution 
of Y given x being the Poisson distribution of mean exp ^0(°) + Y^f=i O^xWj , we use the 
null hypothesis 

Hq : yi, y2, ■ ■ ■ , y n are independent draws from the Poisson distributions with means 

tm \ / m \ / m 

0(0) + 9^x? , exp 0(°) + ^ )x 2 ] , ■ • ■ , exp 9^ + ^ §®x® 

respectively, (15) 

where 9 is the nuisance parameter and 9 is a maximum-likelihood estimate. The measure 
of discrepancy for Hq is usually taken to be the log-likelihood-ratio (also known as the 
deviance) 

n 

9 2 = 2^y fe ln(y fc //i fe ), (16) 

k=l 

where fik is the mean of the Poisson distribution associated with yt in H' ' , namely, 

fi k = exp^+^^x^j. (17) 

One example is the cubic polynomial 

InOfe) = # (0) + {1) x k + 9 {2 \x k ) 2 + 9^\x k f (18) 
for k — 1, 2, . . . , n, which comes from the choice m = 3 and 

Xf, = Xf~\ Xfo = (Xfc) , X^ = (Xfc)' (19) 



for k = 1, 2, . . . , n, given observations as ordered pairs of scalars (x 2 ,y2), 
{x n ,yn)- Of course, t here are similar formula t ions f or other generalized linear models, such 



as those discussed by McCullagh and Nelderl ( 1l989h 
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